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Abstract 

■ Constrained least-squares regression problems, such as the Nonnegative Least Squares 

(713 \ (NNLS) problem, where the variables are restricted to take only nonnegative values, often 

O . arise in applications. Motivated by the recent development of the fast Johnson-Lindestrauss 

transform, we present a fast random projection type approximation algorithm for the NNLS 
^SJ ■ problem. Our algorithm employs a randomized Hadamard transform to construct a much 

\ smaller NNLS problem and solves this smaller problem using a standard NNLS solver. We 

prove that our approach finds a nonnegative solution vector that, with high probability, is 
close to the optimum nonnegative solution in a relative error approximation sense. We ex- 



I perimentally evaluate our approach on a large collection of term-document data and verify 



that it does offer considerable speedups without a significant loss in accuracy. Our analysis 
is based on a novel random projection type result that might be of independent interest. In 
\ particular, given a tall and thin matrix $ G E"^'' (n 3> c?) and a vector y G , we prove that 

. the Euclidean length of $y can be estimated very accurately by the Euclidean length of 

where <& consists of a small subset of (appropriately rescaled) rows of <&. 

•T-H 

X 

H ; 1 Introduction 

, 

The Nonnegative Least Squares (NNLS) problem is a constrained least-squares regression problem 
where the variables are allowed to take only nonnegative values. More specifically, the NNLS 
problem is defined as follows: 

Definition 1 [Nonnegative Least Squares (NNLS)] 

Given a matrix A G R"^'^ and a target vector b G M" , find a nonnegative vector Xopt G such 
that 

Xopt = arg min - ftjlg • (1) 

NNLS is a quadratic optimization problem with linear inequality constraints. As such, it is a 
convex optimization problem and thus it is solvable (up to arbitrary accuracy) in polynomial 
time |4j. In words, NNLS seeks to find the best nonnegative vector Xopt in order to approximately 
express 6 as a strictly nonnegative linear combination of the columns of A, i.e., b ~ Axopt- 

The motivation for NNLS problems in data mining and machine learning stems from the fact 
that given least-squares regression problems on nonnegative data such as images, text, etc., it 
is natural to seek nonnegative solution vectors. (Examples of data applications are described 
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in [6].) NNLS is also useful in the computation of the Nonnegative Matrix Factorization [16j . 
which has received considerable attention in the past few years. Finally, NNLS is the core 
optimization problem and the computational bottleneck in designing a class of Support Vector 
Machines [22|. Since modern datasets are often massive, there is continuous need for faster, more 
efficient algorithms for NNLS. 

In this paper we discuss the applicability of random projection algorithms for solving con- 
strained regression problems, and in particular NNLS problems. Our goal is to provide fast 
approximation algorithms as alternatives to the existing exact, albeit expensive, NNLS methods. 
We focus on input matrices A that are tall and thin, i.e., n ^ d, and we present, analyze, and 
experimentally evaluate a random projection type algorithm for the nonnegative least-squares 
problem. Our algorithm utilizes a novel random projection type result which might be of in- 
dependent interest. We argue that the proposed algorithm (described in detail in Section [3]), 
provides relative error approximation guarantees for the NNLS problem. Our work is motivated 
by recent progress in the design of fast randomized approximation algorithms for unconstrained 
ip regression problems [7j . 

The following theorem is the main quality-of-approximation result for our randomized NNLS 
algorithm. 

Theorem 1 Let e € (0, 1]. Let A G M"^*^ and b G M" be the inputs of the NNLS problem with 
n ^ d. If the input parameter r of the RandomizedNNLS algorithm of SectionlM satisfies 

r ^ 342c2(d+l)log(n) 



(for a sufficiently large constant Cojj then the RandomizedNNLS algorithm returns a nonneg- 
ative vector Xopt such that 

\\Axopt - b\\l < {I + e) min H^x-^Ho, (3) 

holds with probability at least 0.^. The running time of the RandomizedNNLS algorithm is 

0{nd\og{r))+TNNLs{r,d). (4) 

The latter term corresponds to the time required to exactly solve an NNLS problem on an input 
matrix of dimensions r x d. 

One should compare the running time of our method to Tattvls in,d), which corresponds to the 
time required to solve the NNLS problem exactly. We experimentally evaluate our approach on 
3,000 NNLS problems constructed from a large and sparse term-document data collection. On 
average (see section 14. ip , the proposed algorithm achieves a three- fold speedup when compared 
to a state-of-the-art NNLS solver |15j with a small (approx. 10%) loss in accuracy; a two-fold 
speedup is achieved with a 4% loss in accuracy. Computational savings are more pronounced for 
NNLS problems with denser input matrices A and vectors b (see section [42]) . 

The remainder of the paper is organized as follows. Section [2] reviews basic linear algebraic 
definitions and discusses related work. In Section [3] we present our randomized algorithm for 
approximating the NNLS problem, discuss its running time, and give the proof of Theorem [H 
Finally, in section [J] we provide an experimental evaluation of our method. 

^Co is an unspecified constant in [19) . 

^Note that a small number of repetitions of the algorithm suffices to boost its success probability. 
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2 Background and related work 



Let [n] denote the set {1, 2, . . . , n}. For any matrix A G M"-^" with n > d let ^(j), « G [n] denote 
the i-th row of A as a row vector, and let A^^\j € [d] denote the j'-th column of j4 as a column 
vector. The Singular Value Decomposition (SVD) of A can be written as 



A 



(5) 



Assuming that A has full rank, Ua G M"^'^ and Va € W^^'^ are orthonormal matrices, while 
is a (i X d diagonal matrix. Finally, = X]"=i S^=i denotes the square of the Frobenius 



norm of A and ||^||2 = sup^-gj^d^ ^._^q ||^x||2 / ||a^||2 denotes the spectral norm of A. 

The (non- normalized) n x n matrix of the Hadamard- Walsh transform is defined recursively 
as follows: 



H. 



n/2 
n/2 



n/2 



-H. 



n/2 



with 



+1 
+1 



+1 
-1 



The n X n normalized matrix of the Hadamard- Walsh transform is equal to -^Hn] hereafter, we 
will denote this normalized matrix by Hn [n is a power of 2). For simplicity, throughout this 
paper we will assume that n is a power of two; padding A and b with all-zero rows suffices to 
remove the assumption. Finally, all logarithms are base two. 



2.1 Random projection algorithms for unconstrained problems 

The unconstrained least-squares regression problem (£2) takes as input a matrix A G M"^'^ and a 
vector b G M", and returns as output a vector Xopt G '^'^ that minimizes the distance \\Ax — b\\^. 
Assuming n ^ d, various algorithms solve the problem exactly in 0{nd'^) time |14j . Drineas 
et al. [Hi [TO] and Sarlos [21] give randomized algorithms to approximate the solution to such 
problems. The basic idea of these algorithms is to select a subset of rows from A and a subset of 
elements from 6, and solve the induced problem exactly. The fundamental algorithmic challenge 
is how to form the induced problem. It turns out that sampling rows of A and elements of b 
with probabilities that are proportional to the £2 norms of the rows of the matrix of the left 
singular vectors of j4 suffices [9j. This approach is not computationally efficient, since computing 
these probabilities takes 0{nd'^) time. However, by leveraging the Fast Johnson-Lindenstrauss 
Transform of [2j, one can design an o{nd'^) algorithm for this problem [10]. The algorithm of this 
paper applies the same preconditioning step as the main Algorithm of [lOj. The analysis though 
is very different from the analysis of [TO] and is based on a novel random projection type result 
that is presented in section [3l The difficulty of applying the analysis of [lOj here is the fact that 
the solution of an NNLS problem cannot be written in a closed form. Finally, it should be noted 
that similar ideas are discussed in [7], where the authors present sampling-based approximation 
algorithms for the ip regression problem for p = [I, 00). The preconditioning step of the algorithm 
of this paper is different from the preconditioning step of the algorithm of [7j. 



2.2 Algorithms for the NNLS problem 

We briefly review NNLS algorithms following the extensive review in Recall that the NNLS 
Problem is a quadratic optimization problem. Hence, all quadratic programming algorithms 
may be used to solve it. Methods for solving NNLS problems can be divided into three general 
categories: (i) active set methods, (u) iterative methods, and (iii) other methods. The approach 
of Lawson and Hanson in [18] seems to be the first technique to solve NNLS problems. It is a 
typical example of an active set method and is implemented as the function Isqnonneg in Matlab. 
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Immediate followups to this work include the technique of Bro and Jong [5] which is suitable for 
problems with multiple right hand sides, as well as the combinatorial NNLS approach of Dax [8]. 
The Projective Quasi-Newton NNLS algorithm of [15j is an example from the second category. 
It is an iterative approach based on the Newton iteration and the efficient approximation of the 
Hessian matrix. Numerical experiments in [TS] indicate that it is a very fast alternative to the 
aforementioned active set methods. The sequential coordinate-wise approach of [12] is another 
example of an iterative NNLS method. Finally, interior point methods are suitable for NNLS 
computations ^20j. A different approach appeared in [22j. It starts with a random nonnegative 
vector X G M*^ and updates it via elementwise multiplicative rules. Surveys on NNLS algorithms 
include [iims]. 



3 A Random Projection Type Algorithm for the NNLS problem 

This section describes our main algorithm for the NNLS problem. Our algorithm employs a 
randomized Hadamard transform to construct a much smaller NNLS problem and solves this 
smaller problem exactly using a standard NNLS solver. The approximation accuracy of our 
algorithm is a function of the size of the small NNLS problem. 



Vopt = arg mm 



3.1 The RandomizedNNLS Algorithm 

Algorithm RandomizedNNLS takes as inputs an n x d matrix A {n ^ d), an n-dimensional 
target vector b, and a positive integer r < n. It outputs a nonnegative d-dimensional vector Xopt 
that approximately solves the original NNLS problem. Our algorithm starts by premultiplying 
the matrix A and the right hand side vector b with a random n x n diagonal matrix D, whose 
diagonal entries are set to +1 or —1 with equal probability. It then multiplies the resulting matrix 
DA and the vector Db with a small submatrix of the nx n normalized Hadamard- Walsh matrix 
Hn (see section [2]). This submatrix of - denoted by -ff - is constructed as follows: for all 
i G [n], the i-th row of is included in H with probability r/n. Clearly, the expected number 
of rows of the matrix H is equal to r. Finally, our algorithm returns the nonnegative vector 
Xopt £ R'^ that satisfies 

HD{Ax-b) ^ (6) 

In section [331 we will argue that, for any e G (0, 1/3], if we set 

r > 684c^(d+ l)log(n)log(342c^(d+ l)log(n)/e2)/e^ (7) 

then II Aiopt — ^||2 is at most (1+e) worse than the true optimum || Axopi — 6||2- This is a sufficient 
(but not necessary) condition for r in order to satisfy the relative error guarantees of equation ([3]) . 
Indeed, in the experiments of section |4] we will argue that empirically a much smaller value of r, 
for example r = d + 20, suffices. 

3.2 The proof of Theorem [1] 

3.2.1 A random projection type result 

In this section we prove a random projection type result based on the so-called subspace sampling 
procedure [11] that might be of independent interest. In particular, given a matrix <I> G W^^'^ 
with n ^ d (a.k.a., $ is tall and thin), and any vector y G M'^, we argue that the £2 norm of 
$y can be estimated very accurately by where $ consists of small subset of (appropriately 
rescaled) rows of 
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Inputs: A G M"^*^, b G M", positive integer r < n. 
Output: a nonnegative vector Xopt ^ 

1. Let Hn be the n x n normalized Hadamard- Walsh matrix. 

2. Let S* be an n X n diagonal matrix such that for all i € [n], 



Si; 



syn/r, with probability r/n 
0, otherwise 



3. Let H be the matrix consisting of the non-zero rows of SHn- 
(Notice that H has - in expectation - r rows.) 

4. Construct the n x n diagonal matrix D such that, for all i G [n] 
Da = +1 with probability 1/2; otherwise Da = — 1. 

5. Solve 

_ 2 

Xopf = arg min HDAx — HDb , 

xm'^,x>0 2 

using any standard NNLS solver and return the vector Xopt- 



Algorithm 1: The RandomizedNNLS algorithm. 

More specifically, consider the SubspaceSampling algorithm described below. This algo- 
rithm selects a small subset of rows of <I> to construct notice that $ has - in expectation - at 
most r rows. Also notice that ^ contains the i-th row of $ (appropriately rescaled) if and only if 
Sii is non-zero. Lemma [J bounds the approximation error for our subspace sampling algorithm. 

Lemma 1 Let e € (0, 1]. Let $ be an n x d matrix (n ^ d), C/$ be the n x d matrix containing 
the left singular vectors of ^, and U^i^i) denote the i-th row ofU^. Let $ be constructed using the 
SubspaceSampling algorithm with inputs <I>, r, and sampling probabilities pi. If for all i € [n], 

Pi>P\\U^i4l/d, (8) 
for some (3 € (0, 1], and the parameter r satisfies 

> M^, (9) 
log(r) /3e2 

for a sufficiently large constant Cq, then with probability at least 2/3 all d-dimensional vectors y 
satisfy, 



<Am\l- (10) 



Proof: Let ^ = Uq,T,^V^ be the SVD of $ with C/$ G W""^, S$ G M^^'^, and G W^""^. Let 5 
be the nxn diagonal matrix constructed at the first step of algorithm SubspaceSampling, and 
let U'^S'^ SU^ = I + E, where / is the d x d identity matrix, and E some d x d matrix. Then, 
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Input: <I> G M"^"^, integer r < n, set of probabilities Pi > 0, i £ [n] s.t. J2i<^[n] Pi — ^^ 
Output: «! e W'""^, with E (f) < r. 

1. Let 5 be the n x n diagonal matrix such that for all i £ [n], 

g _ i 1/ y^m.m{l,rpi} ,with probability mm{l, rpi} 
\ ,otherwise 

2. Let ^ be the matrix consisting of the non-zero rows of 5$. 
(Notice that $ has - in expectation - r rows.) 



Algorithm 2: SubspaceSampling algorithm 



using these two definitions, submultiplicativity, and the orthogonality and unitary invariance of 
and V$, 



\m\l 



= \y'^V^J:<s,EE^Viy\ 

< \\y'^Vq>T.^\\^ \\E\\2 \\T.^V^y\\^ 

II T l|2 

= ||£'||2 

II T l|2 

= ||£'||2 y\\^ 

= 11^112 Il^yll2- 

Using Theorem 7 of [TT] (originally proven by Rudelson and Virshynin in [19]), we see that 



Bi\\E\\^)<Coi 



'log(r) 
(3r 



\U<s,\\2 \\UcS,\\p = Co] 



I dlog{r) 



(3r 



(11) 



for a sufficiently large constant Cq (cq is not specified in [H]). Markov's inequality implies that 

ll^ll.<3../^. (12) 
with probability at least 2/3. Finally, using equation ([9]) concludes the proof of the lemma. 
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3.2.2 Another useful result 



Results in [2] imply the following lemma. 

Lemma 2 Let U be an n x d orthogonal matrix (n > 20 and n> d). Then, for all i E [n], 



{HnDU) 



(0 



< 



4.2(i log n 



n 



(13) 



holds with probability at least 0.9. 



3.2.3 The proof of Theorem [T] 

We are now ready to prove Theorem [H We apply lemma [T] for $ = [ HnDA —HnDb ] € 
j^nx{d+i)^ the parameter r of Theorem [H sampling probabilities pi = 1/n, for all i G [n], /? = 
1/ (4.21ogn), and e' = e/3 G (0,1/3], where e G (0,1] is the parameter of Theorem [TJ Let C/$ 
be the n x (d + 1) matrix of the left singular vectors of <I>. Note that is exactly equal to 
[/$ = HnDU^j^ _;,], where ?7[^ _6] is the n x (d + 1) matrix of the left singular vectors of [A — b]. 
Lemma [2] for C/[^ and our choice of (3, guarantee that for all i € [n], with probability at least 
0.9 

1/n >^\\{H„DU[A-b])^,)\\l/d^l/n > /3 || ([/<!,)(,) |[ /d. 

The latter inequality implies that assumption ([8]) of Lemma [1] holds. Also, our choice of r, 
which satisfies inequality ([2|) in Theorem [H our choice of (3, and our choice of e, guarantee that 
assumption ([9]) of Lemma [1] is also true. Since all d-dimensional vectors y satisfy equation (jlOp . 

Xopt 



we pick y 



Xopt 
1 



and y 



1 



, thus getting that with probability at least 2/3: 



(1-e') WHnDAxopt - HnDbWl < HnDAxopt - HnDb ^ < (1 + e') \\HnDAxopt - HnDb\\l , (14) 
and 

(1-e') WHnDAxopt - HnDbWl < j^HnDAxopt - HnDb\^^ < (1 + e') \\HnDAxopt - HnDb\\l . (15) 
Manipulating equations (fH|) and (fT5|) we get 



\H„DAx. 



opt 



HnDb\\\ 



< 



< 



< 



HnDAiopt - HnDb 



1 



1-e' 
1 + e' 



HnD AXqpI 



HnDb 



1-e' 
< (l + 3e' 



- WHnDAXopt - HnDbW 



\HnD AXgpt 



HnDb\\t 



< (1 + e) WHnDAxopt- HnDbWl. 

The second inequality follows since Xopt is the optimal solution of the NNLS problem of eqn. ([6]), 
thus Xopt is a sub-optimal solution, and the fourth inequality follows since (1 + e') / (1 — e') < 
1 + 3e', for all e' G (0, 1/3]. In the last inequality, we set e' = e/3. To conclude the proof, notice 
that HnD is an orthonormal square matrix and can be dropped without changing a unitarilly 
invariant norm. Finally, since Lemmas [1] and [2] fail with probability at most 1/3 and 1/10 
respectively, the union bound implies that Theorem [T] fails with probability at most 0.5. 
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3.3 What is the minimal value of r ? 



To derive values of r for which the RandomizedNNLS algorithm satisfies the relative error 
guarantees of Theorem [H we need to solve equation ([2]); this is hard since the solution depends 
on the Lambart W function. Thus, we identify a range of values of r that are sufficient for our 
purposes. Using the fact that for any a > 4, and for any 7 > 2alog(a), 

7 . 
log(7) - 

and by setting a = 3i2cl{d + 1) \og{n)/e'^ in equation ^ (note that 34:2cl{d + 1) log{n)/e'^ > 4), 
it can be proved that every r such that 

r > 684c^(d+ l)log(n)log(342c^((i+ l)log(n)/e2)/e^ (16) 

satisfies the inequality [2] (co is the constant of Theorem [1]). 

3.4 Running time analysis 

In this subsection we analyze the running time of our algorithm. Let r be the minimal value 
that satisfies equation (I16p . First, computing DA and Db takes 0{nd) time. Since H has in 
expectation r rows, Ailon and Liberty in [3j argue that the computation of HDA and HDb takes 
O(radlogr) time. For our choice of r, this is 

Tprecond = 0(nd log(d log(n)/e^) ) . 

After this preconditioning step, we employ an NNLS solver on the smaller problem. The compu- 
tational cost of the NNLS solver on the small problem was denoted as T/vArLs(^; d) in Theorem [TJ 
TNNLsii^jd) cannot be specified exactly since theoretical running times for exact NNLS solvers 
are unknown. In the sequel we comment on the computational costs of some well defined segments 
of some NNLS solvers. 

The NNLS formulation of Definition [T] is a convex quadratic program, and is equivalent to 

min x^Qx — 2q^x, 

x&R'',x>0 

where Q = A^A S M.'^^'^ and q = A^b G M*^. Computing Q and q takes 0{nd^) time, and then 
the time required to solve the above formulation of the NNLS problem is independent of n. Using 
this formulation, our algorithm would necessitate Tprecond time for the computation of HDA (the 

preconditioning step described above), and then Q = (^HDA^ HDA and q = (^HDA^ b can 

be computed in Tmm = 0{r(f') time; given our choice of r, this implies 

TMM = 0{dHog{n)/e^). 

Overall, the standard approach would take 0{nd'^) time to compute Q, whereas our method 
would need only Tprecond + Tmm time for the construction of Q. Note, for example, that when 
n = 0{d'^) and regarding e as a constant, Q can be computed 0{d/ log{d)) times faster than Q. 

On the other hand, many standard implementations of NNLS solvers (and in particular those 
that are based on active set methods) work directly on the formulation of Definition [TJ A typical 
cost of these implementations is of the order 0{nd^) per iteration. Other approaches, for example 
the NNLS method of [15j, proceed by computing matrix- vector products of the form Au, for an 
appropriate d-dimensional vector u, thus cost typically 0{nd) time per iteration. In these cases 
our algorithm needs again Tprecond preprocessing time, but costs only 0{rd'^) or 0{rd) time per 
iteration, respectively. Again, if given our choice of r, the computational savings per iteration are 
comparable with the 0{d/ log{d)) speedup described above. 
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4 Experimental Evaluation 



In this section, we experimentally evaluate our RandomizedNNLS algorithm on (i) large, sparse 
matrices from a text-mining application, and (ii) random matrices with varying sparsity. We 
mainly focus on employing the state-of-the-art solver of [15] to solve the small NNLS problem. 

4.1 The TechTCSOO dataset 

Our data come from the Open Directory Project (ODP) [1], a multilingual open content directory 
of WWW links that is constructed and maintained by a community of volunteer editors. ODP uses 
a hierarchical ontology scheme for organizing site listings. Listings on similar topics are grouped 
into categories, which can then include smaller subcategories. Gabrilovich and Markovitch con- 
structed a benchmark set of 300 term-document matrices from ODP, called TechTCSOO (Technion 
Repository of Text Categorization Datasets j.13]), which they made publicly available. Each term- 
document matrix of the TechTCSOO dataset consists of a total of 150 to 400 documents from two 
different ODP categories, and a total of 15,000 to S5,000 terms. We chose this dataset because 
we believe that it does represent an important application area, namely text mining, and we do 
believe that the results from our experiments will be representative of the potential usefulness of 
our randomized NNLS algorithm in large, sparse, term-document NNLS problems. 

We present average results from S,000 NNLS problems. More specifically, for each of the 
300 matrices of the TechTCSOO dataset, we randomly choose a column from the term-document 
matrix as the vector b, we assign the remaining columns of the same term-document matrix 
to the matrix A, and solve the resulting NNLS problem with inputs A and b. We repeat this 
process ten times for each term-document matrix of the TechTCSOO dataset, and thus solve a 
total of S,000 problems. Whenever an NNLS routine is called, it is initialized with the all-zeros 
vector. We evaluate the accuracy and the running time of our algorithm when compared to two 
standard NNLS algorithms. The first one is described in |l5jl, and the second one is the active 
set method of |18j . implemented as the built-in function Isqnonneg in Matlab. We would also 
like to emphasize that in [15] the authors compare their approach to other NNLS approaches and 
conclude that their algorithm is significantly faster. Note that the method of [15] operates on 
the quadratic programming formulation discussed in Section IS.fl while Isqnonneg operates on 
the formulation of Definition [H Finally, we implemented our RandomizedNNLS algorithm in 
Matlab. The platform used for the experiments was a 2.0 GHz Pentium IV with 1GB RAM. 

Our (average) results are shown in Figure [TJ We only focus on the algorithm of [15] , which 
was significantly faster, running (on average) in five seconds, compared to more than one minute 
for the Isqnonneg function. We experimented with eight different values of the parameter r, 
which dictates the size of the small subproblem (see the RandomizedNNLS algorithm). More 
specifically, we set r to d + i ■ 50, for i = 1 . . . 8, where d is the number of columns in the matrix 
A. Our results verify that (i) the RandomizedNNLS algorithm is very accurate, (ii) that it 
reduces the running time of the NNLS method of [15], and (iii) that there exists a natural tradeoff 
between the approximation accuracy and the number of sampled rows. Notice, for example, that 
the running time of the state-of-the-art NNLS solver of \15\ can be reduced from two to three 
times, while the residual error is from 4% up to 10% worse than the optimal residual error. 

We briefly comment on the performance of RandomizedNNLS when compared to the Isqnon- 
neg algorithm. As expected, the accuracy results are essentially identical with the method of |15j . 
since both methods solve the NNLS problem exactly. Our speedup, however, was much more sig- 

^We would like to thank the authors of [15] for providing us with a Matlab implementation of their algorithm. 
*The actual implementation involves computations of the form t = Au and s = A^t, avoiding the computation 
and storage of the matrix A''" A. 
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Figure 1: Average results of the RandomizedNNLS algorithm compared to the algorithm 
of [15j on 3000 NNLS problems constructed from the TechTCSOO dataset. Relative Error := 
\\Axopt — b\\2 / W-^^opt — ^\21 while Overall Time:= WOTx^pjTx^pf Xopt is computed with the 
method of [15] in Tx^pf time, and Xopt 

with the RandomizedNNLS algorithm (the last step 
employs the method of [ISj) in Tj.^^^ time. Points one through eight on the x-axis of the plots 
correspond to values of the parameter r = d + i ■ 50 for i = 1 ... 8, where d is the number of 
columns of A. On the right panel, Preprocessing time stands for the cost of the multiplication of 
A and b with HD (multiplied by 100 and divided by Tx^^t)-, and Small-problem time stands for 
the cost of solving the small problem using the algorithm of |T5] (multiplied by 100 and divided 
by Tx^pf). For each point of the x-axis: Overall time = Preprocessing time + Small-problem time. 

nificant, ranging from 14-fold to 10-fold for r = d + 50 and r = d + 400 respectively (data not 
shown) . 

4.2 Sparse vs dense NNLS problems 

The astute reader might notice that we evaluated the performance of our algorithm in a rather 
adversarial setting. The TechTCSOO data are quite sparse, hence existing NNLS methods would 
operate on sparse matrices. However, our preprocessing step in the RandomizedNNLS algorithm 
destroys the sparsity, and the induced subproblem becomes dense. Thus, we are essentially 
comparing the time required to solve a sparse, large NNLS problem to the time required to solve 
a dense, small NNLS problem. If the original problem were dense as well, we would expect 
more pronounced computational savings. In this section we experiment with random matrices of 
varying density in order to confirm this hypothesis. 

First, it is worth noting that the sparsity of the input matrix A and/or the target vector b 
do not seem to affect the approximation accuracy of the RandomizedNNLS algorithm. This 
should not come as a surprise since our results in Theorem [1] do not make any assumptions on 
the inputs A and b. Indeed, our experiments in Figure 2 confirm our expectations. 

Prior to discussing our experiments on random matrices of varying density, it is worth noting 
that the NNLS solver of [T5] has a running time that is a function of the number of non-zero entries 
of A. Indeed, the method of [15j is an iterative method where the computational bottleneck in 
the j-th iteration involves computations of the form A^Au, for a d-dimensional vector u. |15| 
implemented their algorithm by computing the two matrix-vector products Au and i^u) 
separately, thus never forming the matrix A^A and thus taking advantage of the sparsity of A. 
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Figure 2: Average results of the RandomizedNNLS algorithm compared to the algo- 
rithm of [T5| on six sets of 100 NNLS problems with different density. Relative Er- 
ror:= \\Axopt — b\\2 / \\Axopt — b\\2, while Overall Time:= WOTx^pjTx^pf Xopt is computed 
with the method of [15] in Tx^p^ time, and Xopt with the RandomizedNNLS algorithm 
(the last step employs the method of [15]) in Tj^pt time. For the six density parameters 
(2%, 4%, 8%, 16%, 32%, 64%), T^.^^, was on average (0.26 sec, 0.40 sec, 0.94 sec, 2.89 sec, 6.27 
sec, 10.57 sec), respectively. 

Indeed, NNLS problems with sparse coefficient matrices A are solved faster than NNLS problems 
with similar-size dense coefficient matrices A by using the method of [15^ 

In order to measure how the speedup of our RandomizedNNLS algorithm improves as the 
matrix A and vector b become denser, we designed the following experiment. First, let the 
density of an NNLS problem denote the percentage of non-zero entries in A and b; for example, 
density{A, b) = 10% means that approximately 0.9(n(i + n) entries in the n x d matrix A and 
the n X 1 vector b are zero. We chose six density parameters (2%, 4%, 8%, 16%, 32%, and 
64%) and generated 100 NNLS problems for each density parameter. More specifically, we first 
constructed ten n x {d + 1) random matrices with the target density (the non-zero entries are 
normally distributed in [0, 1]). Then, for each matrix, we randomly selected one column to form 
the vector b and assigned the remaining d columns to the matrix A. We repeated this selection 
process ten times for each of the ten n x (d+l) matrices, thus forming a set of 100 NNLS problems 
with inputs A and b. We fixed the dimensions to n = 10, 000 and d = 300 and we experimented 
with four values of r = {d, d + 50,d+W0, d+150). In Figure 2 we present average results over the 
100 NNLS problems for each choice of the density parameter. Notice that increasing the density 
of the inputs A and b, the computational gains increase as well. On top of that, our method 
becomes more accurate while the number of the zero entries in A and b become fewer. Notice for 
example, on the right plot of Figure 2, when r = 300, the two extreme cases {density = 2% and 
density = 64%) correspond to an 18% and a 4% loss in accuracy, respectively. Given these two 
observations as well as the actual times of Tx^p^ (see the caption of Figure 2), we conclude that 
the random projection ideas empirically seem more promising for dense rather than sparse NNLS 
problems. 

^The authors of [15] performed extensive numerical experiments to verify that observation; for example see the 
last row of Table 4 on page 14 in [15] and notice that the running time of their method increases as the density of 
A increases. 
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5 Conclusions 



We presented a random projection algorithm for the Nonnegative Least Squares Problem. We 
experimentally evaluated our algorithm on a large, text-mining dataset, and verified that, as 
promised in our theoretical findings, practically it does give very accurate approximate solutions, 
while outperforming two standard NNLS methods in terms of computational efficiency. Future 
work includes the extension of our theoretical findings of Theorem 1 to NNLS problems with 
multiple right hand side vectors. An immediate application of this would be the computation of 
Nonnegative Matrix Factorizations based on Alternating Least Squares type approaches [16\ I17j. 
Finally, notice that, since our analysis is independent of the type of constraints on the vector x, 
our main algorithm can be employed to approximate a least-squares problem with any type of 
constraints on x. 
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